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Consider a parametrized family of general hidden Markov models, 
where both the observed and unobserved components take values 
in a complete separable metric space. We prove that the maximum 
likelihood estimator (MLE) of the parameter is strongly consistent 
under a rather minimal set of assumptions. As special cases of our 
main result, we obtain consistency in a large class of nonlinear state 
space models, as well as general results on linear Gaussian state space 
models and finite state models. 

A novel aspect of our approach is an information-theoretic tech- 
nique for proving identifiability, which does not require an explicit 
representation for the relative entropy rate. Our method of proof 
could therefore form a foundation for the investigation of MLE con- 
sistency in more general dependent and non-Markovian time series. 
Also of independent interest is a general concentration inequality for 
^/-uniformly ergodic Markov chains. 

1. Introduction. A hidden Markov model (HMM) is a bivariate stochas- 
tic process (Xk,Yk)k>o, where (Xk)k>o is a Markov chain (often referred 
to as the state sequence) in a state space X and, conditionally on (Xk)k>o, 
(Yk)k>o is a sequence of independent random variables in a state space Y 
such that the conditional distribution of Yk given the state sequence depends 
on Xf- only. The key feature of HMM is that the state sequence (Xk)k>o is 
not observable, so that statistical inference has to be carried out by means 
of the observations (Yk)k>o only. Such problems are far from straightforward 
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due to the fact that the observation process (Yk)k>o is generally a dependent, 
non-Markovian time series [despite that the bivariate process (X k , Yk)k>o is 
itself Markovian] . HMM appear in a large variety of scientific disciplines in- 
cluding financial econometrics [17, 25], biology [7], speech recognition [19], 
neurophysiology [11], etc., and the statistical inference for such models is 
therefore of significant practical importance [6]. 

In this paper, we will consider a parametrized family of HMM with pa- 
rameter space O. For each parameter 9 £ Q, the dynamics of the HMM is 
specified by the transition kernel Qg of the Markov process (Xk)k>o, and by 
the conditional distribution Gg of the observation Y k given the signal X k . 
For example, the state and observation sequences may be generated accord- 
ing to a nonlinear dynamical system (which defines implicitly Qg and Gg) 
of the form 

X k = a e (X h _ 1 ,W k ), 

Y k = bg(X k ,V k ), 

where ag and bg are (nonlinear) functions and (W k )k>i, (^4)fc>o are inde- 
pendent sequences of i.i.d. random variables which are independent of Xq. 

Throughout the paper, we fix a distinguished element 9* £ 0. We will 
always presume that the kernel Qg* possesses a unique invariant probability 
measure ixg* , and we denote by Fg* and Eg* the law and associated expecta- 
tion of the stationary HMM with parameter 9* (we refer to Section 2.1 for 
detailed definitions of these quantities). In the setting of this paper, we have 
access to a single observation path of the process (Yk)k>o sampled from the 
distribution Fg*. Thus, 9* is interpreted as the true parameter value, which 
is not known a priori. Our basic problem is to form a consistent estimate 
of 9* on the basis of the observations (Yk)k>o only, that is, without access 
to the hidden process (Xk)k>o- This will be accomplished by means of the 
maximum likelihood method. 

The maximum likelihood estimator (MLE) is one of the backbones of 
statistics, and common wisdom has it that the MLE should be, except in 
"atypical" cases, consistent in the sense that it converges to the true param- 
eter value as the number of observations tends to infinity. The purpose of 
this paper is to show that this is indeed the case for HMM under a rather 
minimal set of assumptions. Our main result substantially generalizes pre- 
viously known consistency results for HMM, and can be applied to many 
models of practical interest. 

1.1. Previous work. The study of asymptotic properties of the MLE in 
HMM was initiated in the seminal work of Baum and Petrie [3, 28] in the 
1960s. In these papers, the state space X and the observation space Y were 
both presumed to be finite sets. More than two decades later, Leroux [23] 
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proved consistency for the case that X is a finite set and Y is a general state 
space. The consistency of the MLE in more general HMM has subsequently 
been investigated in a series of contributions [8, 9, 14, 21, 22] using a variety 
of methods. However, all these results require very restrictive assumptions 
on the underlying model, such as uniform positivity of the transition den- 
sities, which are rarely satisfied in applications (particularly in the case of 
a noncompact state space X). A general consistency result for HMM has 
hitherto remained lacking. 

Though the consistency results above differ in the details of their proofs, 
all proofs have a common thread which serves also as the starting point for 
this paper. Let us therefore recall the basic approach for proving consistency 
of the MLE. Denote by p u (i)q]9) the likelihood of the observations Yq for 
the HMM with parameter 9 6 © and initial measure Xq ~ v. The first step 
of the proof aims to establish that for any 9 £0, there is a constant H(9*, 9) 
such that 



lim n" 1 log p v {Y^; 9)= lim n _1 E e * [log^(y n ; 9)] = H{9* , 9), P e *-a.s. 



For 9 = 9*, this convergence follows from the generalized Shannon-Breiman- 
McMillan theorem [2] , but for 9^9* the existence of the limit is far from 



obvious. Now set K(9*,9) = H(6*,9*) - H(6*,6). Then K(8*,9) > is the 



relative entropy rate between the observation laws of the parameters 9* and 
9, respectively. The second step of the proof aims to establish identifiabil- 
ity, that is, that K(9*,9) is minimized only at those parameters 9 that are 
equivalent to 9* (in the sense that they give rise to the same stationary 
observation law). Finally, the third step of the proof aims to prove that the 
maximizer of the likelihood 9 \— > p u (Y^; 9) converges Pg*-a.s. to the max- 
imizer of H(6*,0), that is, to the minimizer of K(0*,0). Together, these 
three steps imply consistency. 

Let us note that one could write the likelihood as 



where p v (Yk\Y _1 ; 9) denotes the conditional density of given Y^~ l under 
the parameter 9 (i.e., the one-step predictor). If the limit of p u (Yi\Y® n ;0) 
as n — > oo can be shown to exist Pe*-a.s., existence of the relative entropy 
rate follows from the ergodic theorem and yields the explicit representation 
H(9\ 9) =l e *[logp"(Fi|y° 00 ; 0)]. Such an approach was used in [3, 9]. Al- 
ternatively, the predictive distribution p 1/ (Yi !: \Yq~ 1 ; 9) can be expressed in 
terms of a measure- valued Markov chain (the prediction filter), so that ex- 
istence of the relative entropy rate, as well as an explicit representation for 
H(9*,8), follows from the ergodic theorem for Markov chains if the predic- 
tion filter can be shown to be ergodic. This approach was used in [8, 21, 22]. 
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In [23], the existence of the relative entropy rate is established by means 
of Kingman's subadditive ergodic theorem (the same approach is used in- 
directly in [28], which invokes the Furstenberg-Kesten theory of random 
matrix products). After some additional work, an explicit representation of 
H(9*,9) is again obtained. However, as noted in [23], page 136, the latter is 
surprisingly difficult, as Kingman's ergodic theorem does not directly yield 
a representation of the limit as an expectation. 

Though the proofs use different techniques, all the results above rely heav- 
ily on the explicit representation of H(9*,9) in order to establish identifia- 
bility. This has proven to be one of the main difficulties in developing con- 
sistency results for more general HMM. For example, an attempt in [14] to 
generalize the approach of [23] failed to establish such a representation, and 
therefore to establish consistency except in a special example. Once identi- 
fiability has been established, standard techniques (such as Wald's method) 
can be used to show convergence of the maximizer of the likelihood, com- 
pleting the proof. 

For completeness, we note that a recent attempt [12] to prove consistency 
of the MLE for general HMM contains very serious problems in the proof [18] 
(not addressed in [13]), and therefore fails to establish the claimed results. 

1.2. Approach of this paper. In this paper, we prove consistency of the 
MLE for general HMM under rather mild assumptions. Though our proof 
follows broadly the general approach described above, our approach differs 
from previous work in two key aspects. First, we note that it is not neces- 
sary to establish existence of the relative entropy rate. Indeed, rather than 
attempting to prove the existence of a limiting contrast function 

lim n- 1 logp u {Y^-e) = H(6*,9), fiV-a.s., 

n— >oo 

which must then shown to be identifiable in the sense that H(6*,6) < 
H(9*,9*) for parameters 9 not equivalent to 9*, it suffices to show directly 
that 

lim sup rT 1 log ?/(r o n ;0) < H{0*,6*), F e *-a.s. 

[note that the existence of H(9*, 9*) is guaranteed by the Shannon-Breiman- 
McMillan theorem, and therefore poses little difficulty in the proof]. This 
simple observation implies that it suffices to obtain a convenient upper 
bound for p u (Yq-,9), which we accomplish by introducing the assumption 
that some iterate Q l e of the transition kernel of the state sequence possesses 
a bounded density with respect to a er-finite reference measure A. 

Second, and perhaps more importantly, we avoid entirely the need to 
obtain an explicit representation for the limiting contrast function H(9*,9) 
which played a key role in all previous work. Instead, we develop in Section 
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4.2 a surprisingly powerful information-theoretic device which may be used 
to prove identifiability in a very general setting (see [26] for related ideas), 
and is not specific to HMM. This technique yields the following: in order to 
establish that the normalized relative entropy is bounded away from zero, 
that is, 



lim inf E#* 

n— >oo 



n log 



>0, 



P u (Y n ;9)_ 

it suffices to show that there is a sequence of sets (^4fc)fc>o such that 
liminf P *(y o n G An) > 0, limsupn" 1 logP£(r n G A n ) < 



[here P^ is the law of the HMM with parameter 9 and initial measure u, while 
p(yQ ; 9*) denotes the likelihood of Yq under Pg*]. It is rather straightforward 
to find such a sequence of sets, provided the law of the observations {Yk)k>o 
is ergodic under Pg* and satisfies an elementary large deviations property 
under Fg. These properties are readily established in a very general setting. 
In particular, we will show (Section 5) that any geometrically ergodic state 
sequence gives rise to the requisite large deviations property, so that our 
main result can be applied immediately to a large class of models of practical 
interest. (Let us note, however, that ergodicity of Pg is not necessary; see 
Section 3.2.) 

Of course, there are some complications. Rather than investigating the 
likelihood function p v (Y^]9) directly, the proof of our main result relies in 
an essential manner on the asymptotics of the process p x (Yq]9) where A is 
the reference measure defined above. The latter process plays a special role 
in our proofs due to the fact that it satisfies a certain submultiplicativity 
property; this allows us to upper bound n~ l \ogp v (Y^; 9) by a time average, 
which possesses an almost sure limit by Birkhoff's ergodic theorem (see 
the proof of Theorem 1 below for further details). As A is typically only a- 
finite, however, it is not immediately obvious that the problem is well-posed. 
Nonetheless, we will see that these complications can be resolved, provided 
that the HMM is sufficiently "observable" so that the improper likelihood 
P X (Yq 1 ;9) is well defined for sufficiently large n (under mild integrability 
conditions). As is demonstrated by the examples in Section 3, this is the 
case in a wide variety of applications. 

Finally, let us note that the techniques used in the proof of our main result 
appear to be quite general. Though we have restricted our attention in this 
paper to the case of HMM, these techniques could form the foundation for 
consistency proofs in other dependent and non-Markovian time series mod- 
els (such as, e.g., the autoregressive setting of [9]), which share many of the 
difficulties of statistical inference in hidden Markov models. Other asymp- 
totic properties of the MLE, such as asymptotic normality, merit further 
investigation. 
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1.3. Organization of the paper. The remainder of the paper is organized 
as follows. In Section 2, we first introduce the setting and notations that are 
used throughout the paper. Then, we state our main assumptions and re- 
sults. In Section 3, our main result is used to establish consistency in three 
general classes of models: linear-Gaussian state space models, finite state 
models, and nonlinear state space models of the vector ARCH type (this 
includes the stochastic volatility model and many other models of inter- 
est in time series analysis and financial econometrics). Section 4 is devoted 
to the proof of our main result. Finally, Section 5 is devoted to the proof 
of the fact that geometrically ergodic models satisfy the large deviations 
property needed for identifiability. In particular, we prove in Section 5.2 
general Azuma-Hoeffding type concentration inequality for ^-uniformly er- 
godic Markov chains, which is of independent interest. 

2. Assumptions and main results. 

2.1. Canonical setup and notation. We fix the following spaces through- 
out: 

• X is a Polish space endowed with its Borel cr-field X. 

• Y is a Polish space endowed with its Borel cr-field 3^- 

• is a compact metric space endowed with its Borel cr-field %. 

X is the state space of the hidden Markov process, Y is the state space of the 
observations, and is the parameter space of our model. We furthermore 
assume that is endowed with a given equivalence relation 2 ~, and denote 

the equivalence class of 9 <E as [9] = f {9' G : 9' ~ 9}. 

Our model is defined as follows: we are given a transition kernel Q :Q x 
X X X — > [0, 1] , a positive cr-finite measure [i on (Y,y), and a measurable 
function g:Q X X X Y— > 1R+ such that j gg(x,y)n(dy) = 1 for all 9,x. For 
each 9 £ 0, we can define the transition kernel Tg on (X, Y) as 



ing which elements 9 £ O should be viewed as "equivalent." We do not require ~ to be 
transitive. 

It should be emphasized that in the setting of this paper, the equivalence relation ~ 
is presumed to be given as part of the model specification, rather than being defined in 
terms of the model: the statistician may choose up to which equivalence she wishes to 
estimate the true parameter generating the observations. One assumption of our main 
result [assumption (A6) below] then requires that parameters 9,9' that are not equivalent, 
denoted 9^9', give rise to observation laws that are distinguishable in a suitable sense. 
In many cases, there is a natural equivalence relation which ensures that this is the case; 
see Section 2.3 below. 




2 This is meant here in the broad sense, that is, 



is a binary relation on B indicat 
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We will work on the measurable space (Q,J-) where Q = (X x Y) , T = 
{X % 3-0 ® N , and the canonical coordinate process is denoted as (Xfc,Y/ c )fc> . 
For each 9 €Q and probability measure v on (X,X), we define to be the 
probability measure on J-) such that (Xk,Yk)k>o is a time homogeneous 
Markov process with initial measure Fq((Xo,Yq) £ C) = f ic(x,y)gg(x,y) x 
p(dy)v(dx) and transition kernel Tg. Denote as Eg the expectation with 
respect to Pg, and denote as P^' the marginal of the probability measure 



Throughout the paper, we fix a distinguished element 9*6 0. We will 
always presume that the kernel Qg* possesses a unique invariant probability 
measure irg* on [this follows from assumption (Al) below]. For ease 

of notation, we will write P^,E e *,P^ instead of P^f ,Eg»* ,Fgt* ' Y . Though 
the kernel, Qg need not be uniquely ergodic for 9 ^ 9* in our main result, we 
will obtain easily verifiable assumptions in a setting which implies that all 
Qg possess a unique invariant probability measure. When this is the case, we 
will denote as 7rg this invariant measure and we define P^E^P^ as above. 

Under the measure Pg, the process (X^, Yfc)fc>o is a hidden Markov model. 
The hidden process (Xk)k>o is a Markov chain in its own right with initial 
measure v and transition kernel Qg, while the observations (lfc)fc>o are con- 
ditionally independent given the hidden process with common observation 
kernel Gg(x,dy) = gg(x,y)fi(dy). In the setting of this paper, we have access 
to a single observation path of the process (lfc)A;>o sampled from the distri- 
bution Fg+. Thus, 9* is interpreted as the true parameter value, which is not 
known a priori. Our basic problem is to obtain a consistent estimate of 9* 
(up to equivalence, i.e., we aim to identify the equivalence class [9*] of the 
true parameter) on the basis of the observations (Yfc)fc>o only, without access 
to the hidden process (X^^q. This will be accomplished by the maximum 
likelihood method. 

Define for any positive cr-finite measure p on (X, X ) 



with the conventions n«=u a u = 1 ii v > w and for any sequence (a s ) s€ z 

and any integers s < t, a* = f (a s , . . . , at). For ease of notation, we will write 

p x (y t s ;9) = p 5x (y t s ;9) for x e X, and we write Piy 1 ^'^) = f p v "(y t s ] 9). Note that 
p p {dxt+\,y t s ;9) is a positive but not necessarily u-finite measure. However, 
if p is a probability measure, then p p (dxt+i,y t s ;9) is a finite measure and 



P£ on (Y N ,y® N ). 




u=s 




pP (yl; 9) <oo. 
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If v is a probability measure, then p u (uq',0) is the likelihood of the ob- 
servation sequence y$ under the law The maximum likelihood method 
forms an estimate of 9* by maximizing 9 i— > p u (i/q; 9), and we aim to establish 
consistency of this estimator. However, as the state space X is not compact, 
it will turn out to be essential to consider also p x (i/q ; 9) for a positive cr-finite 
measure A. 

We conclude this section with some miscellaneous notation. For any func- 
tion /, we denote as |/|oo its supremum norm [e.g., \ge\oo = f su P(a;,y)eXxY 9e(%, 
y)]. As we will frequently integrate with respect to the measure fj,, we will 

use the abridged notation dy instead of fi(dy), and we write dyj, = f Y[ i=s d>Ui- 
For any integer m and 9 E 0, we denote by Q™ the mth iterate of the kernel 
Qg. For any pair of probability measures P,Q and function V > 1, we define 
the norm 



f'-\f\<V J J 

Finally, the relative entropy (or Kullback-Leibler divergence) is defined as 
t^t m \<n\ d -rf / / log(dP/d<Q>) dP, if P « Q, 



for any pair of probability measures P and Q. 

Remark 1. Throughout the paper, we will encounter partial suprema 
of measurable functions [e.g., y$ i— > supg 6W p(yg ; 9) for some measurable set 
U € H]. As the supremum is taken over an uncountable set, such functions 
are not necessarily Borel-measurable. However, as all our state spaces are 
Polish, such functions are always guaranteed to be universally measurable 
([4], Proposition 7.47). Similarly, a Borel-measurable (approximate) maxi- 
mum likelihood estimator need not exist, but the Polish assumption ensures 
the existence of universally measurable maximum likelihood estimators ( [4] , 
Proposition 7.50). All probabilities and expectations can therefore be un- 
ambiguously extended to such quantities, which we will implicitly assume 
to be the case in the sequel. 

2.2. The consistency theorem. Our main result establishes consistency 
of the MLE under assumptions (Al)-(A6) below, which hold in a large class 
of models. Various examples will be treated in Section 3 below. 

(Al) The Markov kernel Qg* is positive Harris recurrent. 



(A2) E e *[sup xeX (log gg*(x,Y )) + ] < oo, E *[|log / ge*(x,Y )7rg*(dx)\] < oo. 



Assumptions (Al), (A2) ensure the existence of the entropy rate for 9*. 





otherwise 
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(A3) There is an integer I > 1, a measurable function }:0xXxX-> R + , 
and a cr-finite measure A on (X, X) such that \qe\oo < o° and 

Q l e (x,A)= [ t A (x')q e (x,x')\(dx') 



for all 9^9*, xeX, AeX. 

Assumption (A3) states that an iterate of the transition kernel Qg pos- 
sesses a density with respect to a cr-finite measure A. This property will 
allow us to establish the asymptotics of the likelihood of Pg in terms of the 
improper likelihood p x (-;9). The measure A plays a central role throughout 
the paper. 

(A4) For every 9 96 9*, there is a neighborhood Ug of 9 such that 

< oo, 



sup \qe>\oo < oo, Eg* 



sup sup(log50/(x,Yb))" 
B'&Ae x€X 



and there is an integer rg such that 



sup (logp x (Y^;9')y 

e'&Ag 



< oo. 



(A5) For any 9 </> 9* and n> rg, the function 9' \— > p (Y^;9 r ) is upper- 
semicontinuous at 9, Fg*-&.s. 

Assumptions (A4) and (A5) are similar in spirit to the classical Wald 
conditions in the case of i.i.d. observations. However, an important difference 
with the classical case is that (A4) applies to p x (yQ 9 ',9), which is not a 
probability density (as A is typically only cr-finite). Assumption (A4) implies 
in particular that p x {y^ B ;9) is PV-a.s. finite. When A is cr-finite, this requires, 
in essence, that the observations contain some information on the range of 
values taken by the hidden process. 

Finally, the key assumption (A6) below gives identifiability of the model. 
In principle, what is needed is that Pa' is distinguishable from Pg* in a 
suitable sense. However, as A may be cr-finite, ¥ X ' Y is not well defined. As a 
replacement, we will consider the probability measure P$ defined by 

(1) F x (Y n eA) = J t A {yl)p^lp{y^-9^)dy n 

for all n>rg and A G j;®( n + 1 ) (note that the definition of Pg depends im- 
plicitly on 9* as well as on 9; the former dependence is suppressed for no- 
tational simplicity). Lemma 11 shows that P^ is well defined, provided that 
(A4) holds and p x (Y^ e ; 9) > IV-a.s. The law P^ is in essence a normalized 
version of Pa' , and (A6) should be interpreted in this spirit. 
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(A6) For every 9^9* such that p x (Yq B ; 6) > ¥ e * -a.s., we have 

liminf FV(Y n G A n ) > 0, limsupn" 1 logP^(Y ™ G A n ) < 

n— s-oo n— >oo 

for some sequence of sets A n G 

Despite that this assumption looks nontrivial, we will obtain sufficient 
conditions in Section 2.3 which are satisfied in a large class of models. 

Having introduced the necessary assumptions, we now turn to the state- 
ment of our main result. Let t v%n : 6 h- > logp" (Yq 1 ; 6) be the log-likelihood 
function associated with the initial probability measure v and the observa- 
tions Yq\ An approximate TTiCLxi'tTiuTTi likelihood estimator (^i/ 5 n)n>o is 

defined 

as a sequence of (universally) measurable functions 6 vn of Yq such that 

n~ x t v , n {K,n) > supn~ 1 £ U)n (6) - o a . s .(l), 
<?ee 

where o a . s .(l) denotes a stochastic process that converges to zero IPV-a-s. 
as n — > oo [if the supremum of £ u>n is attained, we may choose Qy^n — 
argmaxege^n(^)]- The main result of the paper consists in obtaining the 
consistency of d v ^ n . 

Theorem 1. Assume (Al)-(A6), and let v be a fixed initial probability 
measure. Suppose that one of the following assumptions hold: 

1. v ~ ttq*; or 

2. gg*(x,y) > for all x,y, and Qg* is aperiodic; or 

3. gg*(x,y) > for all x,y, and v has mass in each periodic class of Qg*. 
Then e^ n n ^[e% t * -a.s. 

The proof of this theorem is given in Section 4. 

Remark 2. The assumptions 1-3 in Theorem 1 impose different require- 
ments on the initial measure v used for the maximum likelihood procedure. 
When the true parameter is aperiodic and has nondegenerate observations, 
consistency holds for any choice of v. On the other hand, in the case of 
degenerate observations, it is evident that we cannot expect consistency to 
hold in general without imposing an absolute continuity assumption of the 
form v ~ Tig* . The intermediate case, where the observations are nondegen- 
erate but the signal may be periodic, is not entirely obvious. An illuminating 
counterexample, which shows that the MLE can be inconsistent for a choice 
of v that does not satisfy the requisite assumption in this case, is given in 
Remark 12 below. 
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Remark 3. In Theorem 1, we have assumed that the data is generated 
by the stationary measure Pg*. However, it follows directly from Lemma 7 
below that, under the assumptions of Theorem 1, we also have 9 u>n — > [9*] 
Pg*-a.s. for any initial measure p that satisfies the same assumptions as v 
in Theorem 1. Hence, the initial measure of the underlying chain is largely 
irrelevant, both for the consistency of the estimator and in the definition of 
the log-likelihood function £ Ujn (9) that is used to compute the estimator. 

Remark 4. The assumptions of Theorem 1 can be weakened some- 
what. For example, the a-finite measure A can be allowed to depend on 
8, or one may consider maximum likelihood estimates of the form 9 n — 
argmaxflge-^ where the initial measure v used to compute the likeli- 
hood depends on 9 (the latter does not affect the asymptotics of the MLE, 
but may improve finite sample properties in certain cases). Such generaliza- 
tions are straightforward and require only minor adjustments in the proofs. 
In order not to further complicate our notation, we leave these modifications 
to the reader. 

Remark 5. As was pointed out to us by a referee, assumptions (A2) 
and (A4) depend on the choice of the observation reference measure /i, even 
though the maximum likelihood estimator itself is independent of the choice 
of reference measure. It is therefore possible that the assumptions of Theo- 
rem 1 are not satisfied for a given reference measure but that consistency 
of the MLE can be established nonetheless by making a suitable change of 
reference measure. 

2.3. Geometric ergodicity implies identifiability. Most of the assump- 
tions of Theorem 1 can be verified in a straightforward manner. The ex- 
ception is the identifiability assumption (A6), which appears to be nontriv- 
ial. Nonetheless, we will show that this assumption holds in a large class of 
models: it is already sufficient (beside a mild technical assumption) that the 
transition kernel Qg is geometrically ergodic, a property that holds in many 
applications. Moreover, there is a well-established theory of geometric ergod- 
icity for Markov chains [27] which provides a powerful set of tools to verify 
this assumption. Consequently, our main theorem is directly applicable in 
many cases of practical interest. 

Remark 6. Before we state a precise result, it is illuminating to un- 
derstand the basic idea behind the proof of assumption (A6). Assume that 
Qe is ergodic and that Pj ^ . Then there is an s < oo and a bounded 
function h : Y s+1 -»• R such that t e [h(Y s )} = and E fl * [h(Y s )] = 1. Define 
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for n> s. By the ergodic theorem, PX,(A n ) — > 1 and Fg(A n ) ->0asn-> oo. 
To prove (A6), one must show that the convergence ^(An) — > happens at 
an exponential rate, that is, one must establish a type of large deviations 
property. Therefore, the key thing to prove is that geometrically ergodic 
Markov chains possess such a large deviations property. This will be done 
in Section 5. 

Let us begin by recalling the appropriate notion of geometric ergodicity 
(the definition of the norm || • \\y was given in Section 2.1 above). 

Definition 1. Let Vq :X — > [1, oo) be given. The transition kernel Qq is 
called Vq -uniformly ergodic if it possesses an invariant probability measure 
ire and 

\\Q™{x,-) -TTe\\v g <ReUQ m Vo{x) for every x G X, m G N, 
for some constants Rq < oo and ag > 1. 

For equivalent definitions and extensive discussion, see [27], Chapter 16. 
We can now formulate a practical sufficient condition for assumption (A6). 

(A6') For every 6^6* such that p A (Y re ; 9) > Pg*-a.s., there exists a func- 
tion Vq > 1 such that Qq is V^-uniformly ergodic, Fj ^ P^* , and 

(2) P e * (J V e (x rg+1 )p x (dx rg+1 X 9 ^) < °°) > 0. 

Note, in particular, that (2) holds if (A4) holds and \ Vg\oo < oo [in this case, 
(A6') implies that the transition kernel Qq is uniformly ergodic}. In the set- 
ting where (A6') holds, it is most natural to consider the equivalence relation 
~ defined by setting 6^6' if and only if P^ = P^ (i.e., two parameters are 
equivalent precisely when they give rise to the same stationary observation 
laws) . 

Theorem 2. Assume (kl), (A4) and (A6'). Then (A6) holds. 
The proof of this theorem is given in Section 5.1. 

A different sufficient condition for assumption (A6), which does not rely 
on geometric ergodicity of the underlying model, is the following assumption 
(A6"). We will use this assumption in Section 3.2 to show that when X is 
finite set, the identifiability assumption holds even for nonergodic signals. 

(A6") For every 6^6* and initial probability measure v, we have 

liminf P e *(y o n S A n ) > 0, lirnsup?!" 1 logP£(Y n £i„)<0 

n— i-oo rn-oo 

for some sequence of sets A n G y®( n + l ) . 
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Proposition 3. Assume (A4) and (A6"). Then (A6) holds. 

The proof of this proposition is given in Section 5.1. 

3. Examples. In this section, we develop three classes of examples. In 
Section 3.1 we consider linear Gaussian state space models. In Section 3.2, 
we consider the classic case where the signal state space is a finite set. Finally, 
in Section 3.3, we develop a general class of nonlinear state space models. 
In all these examples, we will find that the assumptions of Theorem 1 are 
satisfied in a rather general setting. 

3.1. Gaussian linear state space models. Gaussian linear state space mod- 
els form an important class of HMM. In this setting, let X = K and Y = IR P 
for some integers d,p, and let G be a compact parameter space. The transi- 
tion kernel Tg of the model is specified by the state space dynamics 



where {(£/&, Vk)}k>o is an i-i-d. sequence of Gaussian vectors with zero mean 
and identity covariance matrix, independent of Xq. Here is ^-dimensional, 
Vk is p-dimensional, and the matrices Aq, Rg, Bg, Sg have the appropriate 
dimensions. 

For each 9 £ and any integer r > 1, define 



It is assumed in the sequel that for any 9 £ 0, the following hold: 

(LI) The pair L4g,£?g] is observable and the pair [Ag,i?g] is controllable, 
that is, the observability matrix Og t d and controllability matrix Cg^ are full 
rank. 

(L2) The state transition matrix Ag is discrete-time Hurwitz, that is, its 
eigenvalues all lie in the open unit disc in C. 

(L3) The measurement noise covariance matrix Sg is full rank. 

(L4) The functions 9 \— > Ag , 9 i— >• Rg, 9 i— >■ Bg and 9 ^ Sg are continuous 



(3) 
(4) 



Afc+i = AgXk + RgUk 
Yk = BgX k + SgV k , 




dcf 



r Be i 

BgAg 
BgAj 



and Cg/=[RgAgRg---A r 9 - 1 R e ]. 



\-BgA r e \ 



on 0. 



We show below that the Markov kernel Qg is ergodic for every 9 £ 0. We 
can therefore define without ambiguity the equivalence relation ~ on as 
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follows: 9 ~ 6' iff F% = P£. We now proceed to verify the assumptions of 
Theorem 1. 

The fact that Ag is Hurwitz guarantees that the state equation is stable. 
Together with the controllability assumption, this implies that Qg is Vg- 
uniformly ergodic with Vg(x) x \x\ 2 as \x\ — > oo ([15], pages 929 and 930). 
In particular, Qg* is Ve*-uniformly ergodic, which implies (Al). 

By the assumption that Sg is full rank, and choosing the reference measure 
fj, to be the Lebesgue measure on Y, we find that gg(x,y) is a Gaussian den- 
sity for each i£X with covariance matrix SgSj . We therefore have \gg* \oo = 
(27r)- p / 2 det- 1 / 2 (S' *S'JO <oo,so that E0*[sup x6X (log g e ±(x,Y )) + ] <oo. On 
the other hand, as the stationary distribution ixg is Gaussian, the function 
y i — y J gg*(x,y)irg+(dx) is a Gaussian density with respect to \i. Therefore, is 
easily seen that Eg* [| log/ gg*(x,Yo)irg*(dx)\] <oo, and we have established 
(A2). 

The dimension q of the state noise vector Uk is in many situations smaller 
than the dimension d of the state vector and hence RgRj may be rank 
deficient. However, note that Qg(x,dx') is a Gaussian distribution with co- 
variance matrix Cg^Cj d for each x € X. Therefore, the controllability of the 

pair [Ag,Rg] nonetheless guarantees that Qg(x,dx') has a density with re- 
spect to the Lebesgue measure A on X. Thus, (A3) is satisfied with I = d. 
To proceed, we obtain an explicit expression for p x {yQ,0). 

Lemma 4. For r > d, we have 

P x (y r o - x ;0) = (2vr)( d -^/ 2 det- 1 / 2 (o T r r- r 1 o e , r )det- 1 / 2 (r e , r ) 



x exp(-iy^ffe ir y r ). 



(5) 

x exp(- 2 

Here we defined the matrix Tg >r = J Tig^Hg r + Sg >r Sg r with 

( ■■■ \ 

BgRg 

BgAgRg BgRg ••• 



Ho %r — 



\BgA:- 2 Rg BgA'n~ 3 Rg ■■■ BgRg) 



and where Sg^ r is the pr x pr block diagonal matrix with diagonal blocks equal 
to Sg, y r = [yo, . . . , y r -i] T , and Hg :T is the matrix defined by 

H 9,r = r - Tg^Og^iOjj.Tg^.Og^^Oj^Tg^,. 

Proof. Define the vectors Y r = [Y? , Y^f , U r _i = [C/ T , . . . , Uj_ 2 } T 
and V r = \Vq , . . . , V^ 1 ] T . It follows from elementary algebra that 

Y r = Og r Xo + Tig , r U r _i + Sg r V r 
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for any integer r > 1. Note that, as U r _i and V r are independent, the 
covariance matrix of the vector ?ig )r XJ r _i + Sg jr V r is given by Tg jr . It follows 
that 

p x {yl~\0) = (27r)-W 2 det- 1 /2(r eir )e X p(-i(y r - B>r x) T T^ r (y r - O e>r x)), 

where we have used that Tg r is positive definite (this follows directly from 
the assumption that Sg is full rank). Now let flg :r d = Og^^J r Og ir )~ 1 Oj r 

be the orthogonal projector on the range of Og )T = f T l J 2 Og^ r {tig )T is well 

defined for r > d as the pair [Ag,Bg] is observable, so that Og^ T is full rank). 
Clearly, 

(y P - Og,.x) T Y^ r {y r - O e>r x) = ||n, ir r-; /2 yr - T^ r /2 O e>r x\\ 2 

+ ||(i-n eir )r- r 1/2 yr|| 2 . 

The result now follows from 

fig, r rg' /2 y r -r^ r /2 Og !r x\\ 2 ^ dx = (2vr) d / 2 det- 1 / 2 (o e T r r- r 1 o e>r ) 

(which is immediately seen to be finite due to the fact that Og^ r has full 
rank), and from the identity Hg r = T ^J 2 (I — f[g^ r )Y X J 2 . □ 

Remark 7. As is evident from the proof, the observability assumption 
is key in order to guarantee that p A (?/o _1 i^) i s finite (albeit only for r suffi- 
ciently large). Intuitively, observability guarantees that we can estimate Xq 
from Yq -1 "in every direction," so that the likelihood P x (Uq~ 1 ',0) becomes 
small as \x\ — > oo. This is needed in order to ensure that P x (Uq~ 1 ',0) is inte- 
grable with respect to the cr-finite measure A. It should also be noted that 
for any r >d the matrix Hg r is rank-deficient, showing that (5) is not the 
density of a finite measure. 

Now note that, by our assumptions, the functions 9 i— > det ~ 1 / 2 (Oj jTgjOg^) , 
6 I—?- det ~ 1 ^ 2 (Tg^), and 8 \— > Hg. r are continuous on for any r >d. Thus, 
9 i— > p x (yQ~ 1 ;0) is continuous for every r > d, and it is easily established that 
Kg* [sup0/ eWe (logp x (YQ 8 ; 0')) + ] < oo if we choose rg = d — 1 and a sufficiently 
small neighborhood^. Moreover, note that |g#|oo = 

and \qg\oo = (2-n)~ d / 2 det ~ l / 2 {Cg )( iCj d ). Therefore, by the continuity of Sg 
andC^, we have sup g , €Ue \qg>\oc < oo and Eg* [supg, £Ug sup xeX (log gg,(x, Y )) + ) < 
oo for a sufficiently small neighborhood lig. Thus, we have verified (A4) and 
(A5). 
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It remains to establish assumption (A6). We established above that Qg is 
Vg-uniformly ergodic with Vg(x) x \x\ 2 as \x\ — > oo. Moreover, 9^9* implies 
Wq t^^J* by definition. Therefore, (A6') would be established if 

J \x d \ 2 p x (dx d ,Y^- 1 ;6)<oo, P e *-a.s. 

But note that 

p X (dx d ,Y d - 1 ;e)=p X (Y Q d ~ 1 ;9) < oo, P *-a.s., 

so that p x (dx d ,Y$~ 1 -9) is a finite measure. Moreover, as (Y^' 1 , X d ) = MX + 
£ for a matrix M and a Gaussian vector £, it is easily seen that p x (dx d ,Y*- 1 ; 9) 
must be a random Gaussian measure. As Gaussian measures have finite mo- 
ments, we have established (A6'). Therefore, (A6) follows from Theorem 
2. 

Having verified (A1)-(A6), we can apply Theorem 1. As gg*(x,y) > for 
all x,y, and as Qg* is V$* -uniformly ergodic (hence certainly aperiodic), we 
find that the MLE is consistent for any initial measure v. 

3.2. Finite state models. One of the most widely used classes of HMM 
is obtained when the signal is a finite state Markov chain. In this setting, 
let X = {1, . . . ,d} for some integer d, let Y be any Polish space, and let O 
be a compact metric space. For each parameter 9 G G, the signal transition 
kernel Qg is determined by the corresponding transition probability matrix 
Qg , while the observation density gg is given as in the general setting of this 
paper. 

It is assumed in the sequel that: 

(Fl) The stochastic matrix Qg* is irreducible. 

(F2) E0*[|log00*(cc, Yb)|] < oo for every x G X. 

(F3) For every 9 6 6, there is a neighborhood Uq of 9 such that 



sup (log g e >(x,Y )y 



< oo for all x G X. 



(F4) 9 ^ Qq and 9 \— > gg(x, y) are continuous for any x G X, y G Y. 

Following [23], we introduce the equivalence relation on G as follows: we 
write 9^9' iff there exist invariant distributions 7t,tt' for Q$,Q$', respec- 
tively, such that Wq' = Pi, ' . In words, two parameters are equivalent when- 
ever they give rise to the same stationary observation laws for some choice 
of invariant measures for the underlying signal process. The latter statement 
is not vacuous as we have not required that Qg is ergodic for 9^9*, that is, 
there may be multiple invariant measures for Q$. The possibility that Qg is 
not aperiodic or even ergodic is the chief complication in this example, as 
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the easily verified ^-uniform ergodicity assumption (A6') need not hold. We 
will show nonetheless that assumption (A6") is satisfied, so that Theorem 1 
can be applied. 



Lemma 5. Let C CX be an ergodic class of Qg, and denote by nc the 
unique Qg-invariant measure supported in C. Fix s > 0, and let f : Y s+1 — > R 
be such that |/|oo < oo. Then there exists a constant K such that 



^{/(y/ +s )-p^[/(y s )]} 





- t 2 - 


>t\ < Kexp 


Kn 



for any probability measure v supported in C and any t > 0, n > 1. 



Proof. The proof is identical to that of Theorem 14, provided we re- 
place the application of Theorem 17 by a trivial modification of the result 
of [16]. □ 

Remark 8. As stated, the result of [16] would require that the restric- 
tion of Qg to C is aperiodic. However, aperiodicity is only used in the proof 
to ensure the existence of a solution to the Poisson equation, and it is well 
known that the latter holds also in the periodic case. Therefore, a trivial 
modification of the proof in [16] allows us to apply the result without addi- 
tional assumptions. 

Lemma 6. In the present setting, assumption (A6") holds. 

PROOF. Let 6^6*. We can partition X = E x U • • • U E p U T into the 
p < d ergodic classes Ei,...,E p and the set of transient states T of the 
stochastic matrix Qq. Denote as Tr 9 the unique invariant measure of Qg 
that is supported in Ei. Then we can find an integer s > 1 and bounded 

function h : Y s+1 -)• R such that [h(Y s )} < for all % = 1, . . . ,p and such 
that K e *[h(Y s )} = 1. 

Define for n > 2s the set A n £ y®( n +V as 

n-s \ 

^ eY " +1: R2T^ £ *« + *)> 5 • 

L ' J i=\n/2\+l ) 

As y o °° is stationary and ergodic under ¥q* (because Qq* is irreducible), we 
have 

lim F e *(Y n GA n )= lim F e * 

n— yoo n— >oo 




[n/2\ 



in/ 2]— s 
i=l 



1 

> - 

~ 2 
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by Birkhoff 's ergodic theorem. On the other hand, for any initial probability 
measure v, we can estimate as follows: for some constant K > 0, 

PUY n G K) = ¥UY n G A n , X {n/2] G T) 



+ ^e(y n G A n \X H2] G E^PU^nm G E 
<P^(X [n/2l GT) 

|n/2j-s 



+ max sup 

J = l,...,p suppM g Sj 

n 

< K exp 



i+s' 



Ln/2j 



1 

> - 
~ 2 



The latter inequality follows from the fact that the population in the tran- 
sient states decays exponentially, while we may apply Lemma 5 to obtain 
an exponential bound for every ergodic class Ei. We therefore find that 

lim supra -1 logP£ptf G A n ) < -— < 0, 
n— >-oo -in- 
completing the proof of assumption (A6"). □ 

Let us now check the assumptions of Theorem 1. (Al) follows directly 
from the assumption that Qg* is irreducible. To establish (A2), note that 



L xex 

while we can estimate 



sup(logc/ e *(x,y"o)) + < ^^e*[\ l °S9e*(x,Y )\] < oo, 



log J ge*(x,Yo)ire*(dx 

<K e * sup(log5r e *(x,y )) 
l xex 



sup(log g e *(x,Y ))' 



<J2®o*[\l°E9o*(x,Yo)\]<oo. 
xex 



Assumption (A3) holds trivially for I = 1 and with A the counting measure 
on X [note that l^eloo < 1 for all 9, as qg(x,x') is simply the transition 
probability from x to x'\. To establish (A4), note that supg g @ |<?e|oo < °°? 
while 



Ea 



sup sup(log^/(x,y )) + <E Ee 



L e>eu e xex 



sup (log g e >{x,Y )y 
L e>eu g 



< oo 
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by our assumptions. Moreover, as 



sup (logp A (y ;^))" 
6>eU e 



<E fl 



sup sup ( log d - 



log g e >(x,Y Q )y 



< oo, 



we have shown that (A4) holds with rg = for all 9. Next, we note that the 
continuity of 0i— > Qq and 9 i— > gg(x,y) yield immediately that i— >• p x {hq',9) 
is a continuous function for every n > and y% E Y n+1 , establishing (A5). 
Finally, Lemma 6 and Proposition 3 establish (A6). 

Having verified (Al)-(A6), we can apply Theorem 1. Note that as Qg* is 
irreducible, irg* charges every point of X. Therefore, by Theorem 1, the MLE 
is consistent provided that v charges every point of X (so that v ~ i\g* ) . 



Remark 9. The result obtained in this section as a special case of The- 
orem 1 is almost identical to the result of Leroux [23] . The main difference 
in [23] is that there the parameter space may be noncompact, provided 
the parametrization of the model vanishes at infinity. This setting reduces 
directly to the compact case by compactifying the parameter space 0, so 
that this does not constitute a major generalization from the technical point 
of view. 

However, it should be noted that one cannot immediately apply Theorem 
1 to the compactified model. The problem is that the new parameters "at in- 
finity" are typically sub-probabilities rather than true probability measures, 
while we have assumed in this paper that every parameter 9 6 corresponds 
to a probability measure on the space of observation paths. Theorem 1 can 
certainly be generalized to allow for sub-probabilities without significant 
technical complications. We have chosen to concentrate on the compact set- 
ting, however, in order to keep the notation and results of the paper as clean 
as possible. 

3.3. Nonlinear state space models. In this section, we consider a class of 
nonlinear state space models. Let X = R d , Y = R , and let be a compact 
metric space. For each 9 G 0, the Markov kernel Qg of the hidden process 
(A"fe)fc>o is defined through the nonlinear recursion 

Xk = Ge(Xk-i) + Y, e (X k _i)( k . 

Here (Cfc)fe>i is an i.i.d. sequence of (i-dimensional random vectors which are 
assumed to possess a density pc with respect to the Lebesgue measure A 
on R d , and Gg :R d ->■ R d , Eg :R d ->■ R dxd are given (measurable) functions. 
The model for the hidden chain (Xi-)k>o is sometimes known as a vector 
ARCH model, and covers many models of interest in time series analysis and 
financial econometrics (including the AR model, the ARCH model, threshold 
ARCH, etc.). We let the reference measure p be the Lebesgue measure on 
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Mr, and define the observed process (lfc)fc>o by means of a given observation 
density ge{x,y). 

For any positive matrix B, denote by A m i n (-B) its minimal eigenvalue. 

For any bounded set A C R dxd , define A m = f {A 1 A 2 ■ ■ ■ A m : Ai G A,i = 
1, . . . , m}. Denote by p(A) the joint spectral radius of the set of matrices A, 
defined as 



P(A) 



def 



limsupl sup \\A\ 

77WOO MeAn 



l/m 



Here || • || is any matrix norm [it is elementary that p(A) does not depend 
on the choice of the norm] . We now introduce the basic assumptions of this 
section. 

(NL1) The random variables Ck have mean zero and identity covariance 
matrix. Moreover, pt(x) > for all x G M. d , and \p(\oo < oo. 

(NL2) For each 9 G 0, the function T,g is bounded on compact sets, 
Yiq{x) = o(\x\) as \x\ — > oo, and < inig> e u e mf xeR d X m - m [T,gi (x)T,J, (x)] for 
a sufficiently small neighborhood Ug of 9. 

(NL3) For each £@, the drift function Gg has the form 

Gg(x) = Ag(x)x + hg(x) 

for some measurable functions Ag :M. d — > M. dxd and hg :M d — > R d . Moreover, 
we assume that Gg is bounded on compact sets, hg(x) = o(\x\) as \x\ — > oo, 

and that there exists Rg > such that the set of matrices Ag = {Ag(x) : x G 
R d , \x\ > Rg} is bounded and p(Ag) < 1. 

(NL4) For each 9 G 0, there is a neighborhood Ug of 9 such that 



sup supOog^j'O^o))" 



Eg* 



sup 

e'&Je V 



( log / ge'(x,Y Q )X(dx) 



< oo, 



< oo. 



Moreover, Eg*(J \x\gg{x, Yo)\(dx) < oo) > for each 9 G 0, and 



(log ge*(x,Y )) irg*(dx) 



< oo. 



(NL5) The functions 9 \-t gg(x,y), 9 i-> Gg(x), 9 \-t T,g(x) and x i-)- p^(x) 
are continuous on for every x,y. Moreover, for each 9 G 0, the function 
9' i-)- J <70/(x, Yo^^) is positive and continuous at 0, FV-a.s. 



Remark 10. We have made no attempt at generality here: for the sake 
of example, we have chosen a set of conditions under which the assumptions 
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of Theorem 1 are easily verified. Of course, the applicability of Theorem 1 
extends far beyond the simple assumptions imposed in this section. 

Nonetheless, even the present assumptions already cover a broad class of 
nonlinear models. Consider, for example, the stochastic volatility model [17] 

(c \ j X/.+1 = 4>gX k + crg(k, 

{) \Y k = p e exp(X k /2)e k , 

where (Ck,£k) are i-i-d. Gaussian random variables in M. 2 with zero mean 
and identity covariance matrix, (3g > 0, ag > and \4>g\ < 1 for every 6e8, 
and the functions 9 \— > (fig, 9 i— )■ ag and 9 fig are continuous. Assumptions 
(NL1)-(NL3) are readily seen to hold. The observation likelihood gg is given 

by 

gg(x,y) = (2vr/32)" 1 / 2 exp[-exp(-x)y 2 /2/3 9 2 - x/2]. 
We can compute 

1 1 f 1 

snpgg(x,y) = — =T-r , I g e {x,y)\(dx) = — 
xex V2vre \y\ J \y\ 

As the stationary distribution irg* is Gaussian, it is easily seen that the 
law of Yq under Fg* has a bounded density with respect to the Lebesgue 
measure \i on Y. As J (log(l / \y\)) + fi(dy) < oo, the first equation display of 
(NL4) follows. To prove that Pg*(f \x\gg(x,Yo)\(dx) < oo) > 0, it suffices 
to note that x h-> gg(x,y) has exponentially decaying tails for all \y\ > 0. 
The remaining part of (NL4) follows easily using that ixg* is Gaussian and 
Eg*(yj) 2 ) < oo. Finally, (NL5) now follows immediately, and we have verified 
that the assumptions of this section hold for the stochastic volatility model. 
Similar considerations apply in a variety of nonlinear models commonly used 
in financial econometrics. 

We show below that the Markov kernel Qg is ergodic for every 9 £0. We 
can therefore define without ambiguity the equivalence relation ~ on G as 
follows: ~0' iKfJ = ¥%,. We now proceed to verify the assumptions of 
Theorem 1. 

It is shown in [24], Theorem 2, that under conditions (NL1)-(NL3), the 
Markov kernel Qg is ^-uniformly ergodic for each 9 S with V{x) = 1 + \x\ . 
In particular, assumption (Al) holds. The first part of (A2) follows directly 
from (NL4). To prove the second part, we first note that Qg has a transition 
density 

qg(x,x') = |det[£,(x)]rV " Gg(x)}) 
with respect to the Lebesgue measure A on X. This evidently gives 
\qe\oo = sup|det[S 6 i(x)]|" 1 |p ? | 0O < oo 



22 



DOUC, MOULINES, OLSSON AND VAN HANDEL 



by (NL1) and (NL2), which implies in particular that irg* has a bounded 
density with respect to A. Therefore 



log / gg*(x,Y )ng*(dx] 



< |g<9*|ooE0* 



log / go+(x,Y )\(dx] 



< oo 



by (NL4). On the other hand, as x \— > (logx) is convex, we have 



log / ge*{x,Y )Tr e *(dx) 



(log 99*(x,Y )) ir e +(dx) 



< oo 



by Jensen's inequality and (NL4). Therefore, (A2) is established. We have 
already shown that Qg possesses a bounded density, so (A3) holds with 1 = 1. 
Assumption (A4) with rg = follows directly from (NL4) and (NL1), (NL2). 

To establish (A5), let ug(dx,y) = gg(x,y)X(dx)/ f gg(x,y)X(dx). By (NL5), 
ug(dx,Yo) is a probability measure Fg*-a.s., and for every 9 £ O the den- 
sity function 9' ' i— >■ gg> (x ,Yq) / j gg>(x,Yo)X{dx) is continuous at 9 P^-a.s. By 
Scheffe's lemma, this implies that for any 9 6 O, the map 9' \— > ugi(-,Yo) is 
continuous at 9 ¥g*-a.s. with respect to the total variation norm || • Sim- 
ilarly, as Ot-^qg(x,x') is continuous by (NL5), the map 9 \-t Qg(x,dx') is 
continuous with respect to the total variation norm. Now note that we can 
write 



gg(x,Y )X(dx)) / p x '(Y 1 n ;9)Qg(x,dx')ug(dx,Y ). 



p x (Y n ; 



From (NL4), it follows that x i— >■ sup^/ eZ ^ gg/(x, Y^) is bounded P#*-a.s. for 
every k. Therefore, x sup#/ e ^ p x (Y{ 1 ; 6') is a bounded function Fg*-a.s., 
and by dominated convergence the function 9' i— >• p x (Y{ l ;9') is continuous at 
9 ¥g*-a.s. for every 9 £ 0. Therefore, it follows that P#*-a.s. 



p x ' {Y?;9 n )Qe n {x,dx')ve n {dx,Y Q ) - / p x '(Y l n ;9)Qg(x,dx')u e (dx,Y ) 



< 



J \p*\Y?-e n )-p x '{Y?-e)\Qe{x 1 dx>)v e (,dx,Y ) 
+ sup \p(Y{ l ;6')U\isg n (;Y )Qg n - isgi-^Qgh 

n—tq c q 

for any sequence (9 n )n>o Cb(g, 9 n — )• 9. Here we have used the dominated 
convergence theorem to conclude convergence of the first term, and the 
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continuity in total variation established above for the second term. (A5) 
follows. 

It remains to establish assumption (A6). We established above that Qg is 
T^-uniformly ergodic with V(x) = 1 + |x|. Moreover, 9^9* implies K ^ K* 
by definition. Therefore, (A6') would be established if 

\x'\Qg(x,dx')gg(x,Yo)X(dx) < oo I > 0. 



But as Qg is ^/-uniformly ergodic, it follows that QgV < agV + Kg for some 
positive constants ag,Kg ([27], Theorem 16.0.1). Assumption (A6') therefore 
follows from (NL4), and (A6) follows from Theorem 2. 

Having verified (A1)-(A6), we can apply Theorem 1. As gg*(x,y) > for 
all x,y, and as Qg* is ^-uniformly ergodic (hence certainly aperiodic), we 
find that the MLE is consistent for any initial measure v. 

Remark 11. The assumption in (NL4) that 



Eg* 



J (logge*(x,Y Q )) ■ng*(dx) 



< oo 



is used to verify the second part of (A2). This condition can be replaced by 
the following assumption: there exists a set D & X such that: 

(i) Eg* [(log j D gg* (x, Y )\(dx))~] < oo, and 

(ii) inf x . iX ./ eD qg* (x, x') > 0. 

The latter condition is sometimes easier to check. 

To see that the result still holds under this modified condition, note that 



log J gg*(x,Y )TTg*(dx] 



< oo 



Eg* 

follows as above. On the other hand, 

gg*{x,Y )irg*(dx) > / gg*(x' ,Y )qg*(x,x')irg*(dx)\(dx') 

JDxD 

>TTg*(D) inf q e *(x,x') g 9 *(x,Y )\(dx). 
x,x'eD J D 

It follows from (i) that X(D) > 0, so that irg*(D) > also (as Qg*, and 
therefore irg* , has a positive density with respect to A). It now follows directly 
that also Eg* [(log J gg* (x,Yo)ir$* (dx))~] < oo, and the claim is established. 

4. Proof of Theorem 1. The proof of Theorem 1 consists of three parts. 
First, we prove pointwise convergence of the log-likelihood under the true 
parameter 9* (Section 4.1). Next, we establish identifiability of every 6^9* 
(Section 4.2). Finally, we put everything together to complete the proof of 
consistency (Section 4.3). 
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4.1. Pointwise convergence of the normalized log-likelihood under 6* . The 
goal of this section is to show that our hidden Markov model possesses a 
finite entropy rate and that the asymptotic equipartition property holds. We 
begin with a simple result, which will be used to reduce to the stationary 
case. 

Lemma 7. Assume (Al). Then {Yk)k>o is ergodic under Fq* . Moreover, 
if one of the assumptions 1-3 of Theorem 1 hold, then ~ PX. 

Proof. As Qq* possesses a unique invariant measure by (Al), the kernel 
Tq* possesses a unique invariant measure also. This implies that the process 
(Afc,Yfc)fc>o is ergodic under the stationary measure Pg* (as the latter is 
trivially an extreme point of the set of stationary measures). Therefore, 
(Yk)k>o is ergodic also. 

If v ~ 7Tg* , it is easily seen that 
Otherwise, we argue as follows. Suppose that Qg* has period d [this is guar- 
anteed to hold for some d 6 N by (Al)]. Then we can partition the signal 
state space as X = C\ U • • • U U F, where C±, . . . , are the periodic classes 
and 7r(F) = ([27], Section 5.4.3). Note that C±, . . . ,Cd are absorbing sets 
for Q^* where the restriction of Qq* to Cj is positive Harris and aperiodic 
with the corresponding invariant probability measure 7rl*. Moreover, the 
Harris recurrence assumption guarantees that P^*(A n ^ F eventually) = 1 
for all x <E X. Therefore, uQ#(F) 1 and uQp(d) t 4 as n -»• oo. It follows 
from the ergodic theorem for aperiodic Harris recurrent Markov chains that 

d 

\\,,nn nn II n ->9° „v defv^ i J 

ll^Ve* -710*^0* 111 — ► , 7i> = 2_^c u ir g +. 

i=i 

Using gg*(x,y) > and [31], Lemma 3.7, this implies that P^ ~Pgf*'*\ 
But if v has mass in each periodic class Cj or if d= 1, then c l u > for all 

i = 1, . . . , d. Thus, tt£* ~ vr = i £? =1 vr^ which implies P£ ~ P$»* ' Y ~ P^ . 
□ 

We will also need the following lemma. 

Lemma 8. Assume (A2). Then E e *[|logp(y o n ; 9*)\] < oo for all n > 0. 
Proof. We easily obtain the upper bound 

E e .[(logp(r ";«*))+]<E«. 



E 

_fc=0 



sup(log ge*(x,Y k ))~ 



< oo. 
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On the other hand, we can estimate 

log: 
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E e *[logp(Y n ;9*)] = E e 

+ E e 



5> 



ge*(x, Y k )ir e *(dx) 
log / ge*(x,Y )TTg*(dx 



lk=0 

> -(n + l)E £ 

> — oo, 

where we have used that relative entropy is nonnegative. □ 



The main result of this section follows. After a reduction to the stationary 
case by means of the previous lemma, the proof concludes by verifying the 
assumptions of the generalized Shannon-Breiman-McMillan theorem [2]. 

Theorem 9. Assume (Al) and (A2). There exists -oo < 1(9*) < oo 
such that 



(7) 

and such that 
(8) 



1(9*)= lim E e *[n- 1 logp(*o , ;0*)]> 



lim n- 1 logp"(y o n ;0*) 



V-a.s. 



/or any probability measure v such that one of the assumptions 1-3 of The- 
orem 1 is satisfied (in particular, the result holds for v = ttq* ). 



Proof. Note that D n d = E fl *[logp(y o n+1 ;0*)] - E 6 *[logp(Y™;9*)} is a 
nondecreasing sequence by [2], page 1292, and Lemma 8. Therefore (7) fol- 
lows immediately. As Yq° is stationary and ergodic under Fq* by (Al), we 
can estimate 



-oo < Dq < sup-D n 

n>0 



V= lim Ee+in-Hogp^e*)} 



<E«* 



sup(logg g *(x,Y )Y 



< oo, 



where we have used (A2). To proceed, we note that the generalized Shannon- 
Breiman-McMillan theorem ([2], Theorem 1), implies that (8) holds for v = 
ttq*. Therefore, to prove (8) for arbitrary is, it suffices to prove the existence 
of a random variable C v satisfying IPe*(0 < C v < oo) = 1 and 

p u (Ytf;9* 



(9) 



lim 

n^oo p(Y^;9* 



-a.s. 
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Let P» = P^(y o n G •) and P n d = P *(Y o n G •)• Then p u (Y<?; 9*)/p(Y^; 9*) = 
dP»/dP n , and we find that (9) holds with < C v = dP^f/dP^ < oo pro- 
vided Pgi y ~ P^. But the latter was already established in Lemma 7. □ 

Remark 12. In the case that gg*(x,y) > but Qg* is periodic, the 
assumption in the above theorem that the initial probability measure v has 
mass in each periodic class of Qg* cannot be eliminated, as the following 
example shows. Let X = Y = {1,2}, and let Qg be the Markov chain with 
transition probability matrix Q and invariant measure tt (independent of 9) 

Q= (l o)' 7T= {l/l 

Then Qg is positive (Harris) recurrent with period 2. For each 9 G = 
[0.5,0.9], define the observation density gg{x, y) (with respect to the counting 
measure) 

ge(x, y) = 9l y=x + (1 - 9)t y ^x, 

and let 9* = 0.7, for example. Then certainly assumptions (Al) and (A2) 
are satisfied. 

Now consider v = 6±. Then v only has mass in one of the two periodic 
classes of Qg*. We can compute the observation likelihood as follows: 

n 

log^(Y 2 "; 9) = ]T{ly 2fc= i log 9 + ly 2fe=2 log(l - 9)} 

k=0 

n 

+ X>r»-i=a log 9 + t Y2k _ 1= i log(l - 9)}. 

k=l 

A straightforward computation shows that 

lim (2n)" 1 logp u (Y^ n ;9) 

n—tao 

= {0* lo g + (1 - 9*) log(l - 9)}l Xo=1 

+ {(1 - 0*) log9 + 0*log(l - 9)}t Xo =2, P e *-a.s. 

Therefore, limn^oo n~ l log p 1, (Yq 1 ; 9*) is not even nonrandom Pg*-a.s., let 
alone equal to £(9*). Thus we see that Theorem 9 does not hold for such v. 
Moreover, we can compute directly in this example that 

lim vn = 9*l Xo =i + 0.51x =2, IV-a.s., 

n— >oo 

so that evidently the maximum likelihood estimator is not consistent when 
we choose the initial measure v. This shows that also in Theorem 1 the as- 
sumption that v has mass in each periodic class of Qg* cannot be eliminated. 
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4.2. Identifiability. In this section, we establish the identifiability of the 
parameter 9. The key issue in the proof consists in showing that the relative 
entropy rate between p(-;0*) and p x (-;9) may be zero only if 9* ~ 9. Our 
proof is based on a very simple and intuitive information-theoretic device, 
given as Lemma 10 below, which avoids the need for an explicit representa- 
tion of the asymptotic contrast function as in previous proofs of identifia- 
bility. 

Definition 2. For each n, let P n and Q n be probability measures on a 
measurable space (Z n ,Z n ). Then (Q n ) is exponentially separated from (P n ), 
denoted as (Q n ) H (Pn)> if there exists a sequence (A n ) of sets A n £ Z n such 
that 

liminfP n (A n ) >0, 

n— >oc 

limsupn -1 logQ n (A n ) < 0. 

n— >oo 

If P and Q are probability measures on (Y N ,3^ N ), then we will write Q H P 
if (Q n ) H (P n ) with Q n = Q(y o n e •) and P n = p(y ™ e •)• 

Lemma 10. // (Q n ) H (P n ), i/ien liminf^oo n- 1 KL(P n ||Q n ) > 0. 

Proof. A standard property of the relative entropy ([10], Lemma 1.4.3(g)), 
states that for any pair of probability measures P, Q and measurable set A 

KL(P||Q) >P(A)logP(A) -F(A)logQ(A) - 1, 

where OlogO = by convention. As a; logs > — e _1 , we have 

liminfn- 1 KL(P n ||Q n ) > fliminf f(A n )) (- limsupn" 1 logQ(^ n )V 
The result follows directly. □ 



(10) 



hY 



lim inf Eg 

n— >oo 



log : 



>0. 



As a consequence of this result, we obtain positive entropy rates: 

'p u (Y n ;e)_ 

This yields identifiability of the asymptotic contrast function in a very simple 
and natural manner. It turns out that the exponential separation assumption 
Fq' Y H Pgl always holds when the Markov chain Pg is ^-uniformly ergodic 
and v(y) < oo; this is proved in Section 5 below. This observation allows us 
to establish the consistency of the MLE in a large class of models. 

There is an additional complication that arises in our proof of consistency. 
Rather than (10), the following result turns out to be of crucial importance: 

P(Y n ;9*y 



lim inf Eg 

n— >oo 



n 



log 



p\Y^9) 



>0. 
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This result seems almost identical to (10). However, note that the probability 
measure v is replaced here by A, the dominating measure on (X, X), which 
may only be c-finite [A(X) = oo]. In this direct application of Lemma 

10 is not possible since y$ ^P^iVo'i®) i s n °t a probability density: 



p x {y^9)dy% = \{X) 



oo. 



Nevertheless, the following lemma allows us to reduce the proof in the case 
of an improper initial measure A to an application of Lemma 10. 

Lemma 11. Assume (A4). For 9 ^ 9* such that p x {Y^ 8 ;6) >0 tg*-a.s., 
there exists a probability measure P x on (Y N ,3^® N ) such that 



for all n>r e and A^ y®^ 1 ) . 



P x (y e ;0) 



p(y r °;9*)dy% 



PROOF. As J p x (yfi;9)dy™ g+1 =p (y r e ;9) < oo P<j*-a.e. for all n > r e by 
Fubini's theorem and assumption (A4), we can define for n>rg 



(11) 



P X (y^e) =p\y%;9) 



def \, 



p(y r °-9*) 



< oo, 



dyQ-a.e. 



P x (yl°;0) 

Note that, by construction, {p x (yQ , 9) dy$ : n > r#} is a consistent family of 
probability measures. By the extension theorem, we may construct a proba- 
bility measure P^ on (Y N , y m ) such that F X (Y^ £ A) = J l A (yo)p X (yo, 9) dy%. 



□ 



(12) 



Theorem 12. Assume (A2), (A4) and (A6). Then for every 9 ^ 

P(Y n ;9*y 



liminf Eg* 

n— >oo 



n 



log 



P x (Y n ;9) 



>0. 



Proof. Fix 9^9*. Let us assume first that F e * {p x {Yq 6 ; 9) = 0) > 0. As 
we have f p x (yQ',9) dy^ g+l = p x {y r Q ';9) by Fubini's theorem, it must be the 
case that ¥0*(p x (Y^; 9) = 0) > for all n > rg, so that the expression in (12) 
is clearly equal to +oo. Therefore, in this case, the claim holds trivially. 

We may therefore assume that p x (Yq 6 ;9) > P#*-a.s. Let p x (yQ,9) be as 
in the proof of Lemma 11. Note that E0*[|logp(Y o re ;#*)|] < oo by Lemma 8, 
while E e *[(logp A (y o r " ;#))+] < oo by assumption ( A4) . Then we find 



lim inf Eg* 

n— >oo 



n 



-log 



< lim inf Eg 

n— >oo 



)l 



-Hog 



P(Y n ;9*) 
P X (Y n ;9) 



Assumption (A6) gives 
□ 



HPg^. Therefore, (12) follows from Lemma 10. 



CONSISTENCY OF THE MLE IN HIDDEN MARKOV MODELS 



29 



4.3. Consistency of the MLE. Proofs of convergence of the MLE 
typically require to establish the convergence of the normalized likelihood 
n^ 1 log p u (Y^; 9) IV-a.s. for any parameter 9. The existence of a limit fol- 
lows from the Shannon-Breiman-McMillan theorem when 9 = 9* (as in 
Theorem 9), but is far from clear for other 9. In [23], the convergence of 
n^ 1 logp u (Y^ l ;9) is established using Kingman's subadditive ergodic theo- 
rem. This approach fails in the present setting, as log p v (Yq\ 9) may not be 
subadditive even up to a constant. 

The approach adopted here is inspired by [23]. We note, however, that it is 
not necessary to prove convergence of n" 1 logp^(Y n ; 9) as long as it is asymp- 
totically bounded away from £(9*), the likelihood of the true parameter. It 
therefore suffices to bound n~ 1 logp t/ (Y n ; 9) above by an auxiliary sequence 
that is bounded away from £(9*). Here the asymptotics of n" 1 logj> A (Y n ; 9) 
come into play. 

Lemma 13. Assume (A1)-(A6). Then, for any 9</>9*, there exists an 
integer ng and ng > such that B(9,ng) CUg and 

sup logp x (Y^;9>) 



ng + I 



1 



ng + I e'£B{0,r)e) 



SUp log 1 90/ 



+ 



l-l 

ng + V 



sup sup(logg0/(a;,Yo))~ 



< 



Here B(9,n) C 6 is the ball of radius rj > centered at 9 £ 0. 

Proof. By (7) and Theorem 12, limsup n n^Eg* [logp A (Y n ; 9)\ < £{0*). 
Using (A4), this implies that there exists a (nonrandom) integer ng > rg 
such that 

tHiogp A 0C;#)] + — ^ su p iog|. 



(13) 



ng + I 



+ 



l-l 



-Eg, 



ng + I g/ eUe 



sup sup(log5i e /(x,Yo))" 
1'eUe xex 



ng + I 

For any n > such that B(9, n) C Ug, we have 

sup logp X (Y n <>;9') 
8>eB(8,n) 

< sup(logp A (y o r ";^)) + 



<£(9* 



+ 



E 

k=r g +l 



sup sup(log5 /(a:,Y fc ))" 
e'eUgxex 
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where the right-hand side does not depend on rj and is integrable. But then 



(14) 



lim sup Kg 



sup \ogp x (Y n °;6') 
0'eB{e,Tj) 



lim sup sup logp A (yj)™ 9 ; 9 ! ) 

ViO 6'e.B(6,n) 



<E e *ilogp x (Y n °;9)}, 



by (A5) and Fatou's lemma. Together (14) and (13) complete the proof. □ 

Proof of Theorem 1. Since, by Theorem 9, lim^oo n~ x £ v ^ n (9*) = 
£(6*), Fg*-a.s., it is sufficient to prove that for any closed set C C such 
that C n [9*] = 



lim sup sup n -1 4,,n(0') < ^(0*), 

n— >oo d'eC 



V-a.s. 



Now note that {B(9,r]g):9 € C} is a cover of C, where rjg are defined in 
Lemma 13. As is compact, C is also compact and thus admits a finite 
subcover {B(9i, 770J : 9i G C, i = 1, . . . , N}. It therefore suffices to show that 



limsup sup n _1 4, n (0') <£(0*), 

n^oo 0>£B(0,r]g)nC 



-a.s. 



for any 9 ^ 9*. Fix 9^9* and let rje and ng be as in Lemma 13. Note that 

(is) ^O^W;^^;^^- 1 )!^!*, 
(16) p^^^O^p'Cyf^'^te+^O^CCii- 1 )!©'^, 



for any j <m, m + l <n and 9' ^ 9*, where g* B {y{) = Hl =i sup xeX gg(x,ye). 
We can therefore estimate, for all n sufficiently large, 

-. ng+l 

tuA#) < —7 E^-i(^) +logp A (^ r _ 1 ;^) + log(5^(^ +r - 2 )|^U)} 

71. n -\- I ' 



ng + I 

1 

ng + I 



r=l 

ng+H(n)-l 



=i fe=i 

ng+Z i(n) — l 



(n e +0(fc-l)+«+r-2 
(n e +/)(fc-l)+r 



r=l fc=l 
ng+l 



+ 



1 Y.logig^Y^gl^ 



ng + Z 



n e +0(«(n)-l)+r 



)) 



r=l 



E logp A (y^ 1 +r " 1 ;0 / ) +X)suplog^(x,y fc+T 

+ t r=l I k=0 J "" 
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(n e +l)(i(n)-l) ( 1-2 

+ (*(n) - l)logMoo + —j E logsW" 1 ) 
H —7 E E suplog^/0,Y fc ), 

ne + r=l fc=( nfl+ /)(i( n )_l) +r a:eX 

where i(n) == max{m € N : m(ng + /) < n}. Here we have applied (15) with 
m = r — 1 in the first inequality, while we have repeatedly applied (16) for 
every m = (n$ + l)k + r — 1, < 2 (re) — 2 in the second inequality, together 
with the simple estimates l v , r -\(6') < log (/^(Y^ 7 "" 1 ) and 

P A ( Y (n e +0(i(n)-2)+/+r-i; 

<T T ,A/v( ri 9+ I X i ( rl )~ 1 )+ r-1 .fl'W fv™ ~\ 
K Y (n g +l)(i(n)-2)+l+r-V a )Se' \ r (n e +l)(i(n)-l)+r ) ■ 

We can now estimate, for all n sufficiently large, 

sup Z v , n {6') 
e'eB(e,m)nc 

(ng+l)(i{n)-l) 

<-^7 E su p log/W^-r- 1 ;^) 

1-2 (ng+l)(i(n)-l) 

+ V — V SUp SUp(log5r e /(x,y fc+r )) + 

t^o ne + L r^l 9'eB(9, Ve )nCxeX 

+ (i(n)-l) sup log|<^'|oo 
0'eB(6,r)g)nC 

n # + ' ~y e'£B(0, Ve )nc 

+ E SU P sup(log5f e /(x,yfc)) + , 

fc=n _ 2(n8+I)+1 «B(9,*)nc^x 

where we have used that (rag + l)(i(n) — 1) + r > n — 2{ng + 1) + 1 to estimate 
the last term. But as i(n)/n — > (n$ + Z)" 1 as n — >• 00, we find that 

limsup sup n- x l v ^(B') < — ^— -E„* \ sup logp A (y o " 9 ; 0') 
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sup sup(logg e/ (x,Y )Y 
e'eB(e,rjg)x&x 



n e + 1 
1 

H —j sup log \q e > |oo 

no + « gi e B(6,T) e ) 



< t{tr) 

by (A4), Birkhoff's ergodic theorem, Lemma 13, and the elementary fact 

that lim n i ELn-r+i & = lim« ~ ELi & ~ lim n \ Efc=i 6 = ° for an y sta- 
tionary ergodic sequence (Ck)k>o with E(|£i|) < oo. This completes the proof. 
□ 

5. Exponential separation and V-uniform ergodicity. As is explained in 
Remark 6, the key step in establishing assumption (A6) is to obtain a type of 
large deviations property. The following Azuma-Hoeffding type inequality 
provides what is needed in the ^-uniformly ergodic case. 

Theorem 14. Assume that Qq is Vg -uniformly ergodic. Fix s > 0, and 
let f : Y s+1 — > K. be such that \f\oo < oo- Then there exists a constant K such 
that 



i=l 



>t\< Kv(V) exp 



1 ft 



K \ n 



At 



for any probability measure v and any t > 0. 



We will first use this result in Section 5.1 to prove Theorem 2. In Sec- 
tion 5.2, we will establish a general Azuma-Hoeffding type large deviations 
inequality for V- uniformly ergodic Markov chains, which forms the basis 
for the proof of Theorem 14. Finally, Section 5.3 completes the proof of 
Theorem 14. 



5.1. Proof of Theorem 2. We begin by proving that exponential separa- 
tion holds under the ^-uniform ergodicity assumption. 

Proposition 15. Assume (Al) and (A6'). For any 9^6* withp x (YQ 6 ;9 
¥g*-a.s. and probability measure v such that viVg) < oo, we have P^' H 



> 



PROOF. Fix 8 96 6*. As Fj / P|l by assumption (A6'), there exists an 
integer s > and a bounded measurable function h : Y s+1 — > K such that 
E e [h(Y s )] = and E e *[h(Y Q s )] = 1. Define for n > s the set A n £ y®( n + l ) as 



.4, 



def 



n+l 



1 



n 



- 2 
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As is stationary and ergodic under Pg* by (Al), Birkhoff's ergodic 
theorem gives P|l (A n ) — > 1 as n — > oo. On the other hand, Theorem 14 shows 

that limsupjj.j.oo n _1 log¥g ,Y (A n ) < 0. Thus, we have established ¥g ,Y H P^. 
□ 

Proposition 15 is not sufficient to establish (A6), however: the problem 
is that we are interested in the case where v is not a probability measure, 
but the (j-finite measure A. What remains is to reduce this problem to an 
application of Proposition 15. To this end, we will use the following lemma. 



Lemma 16. Assume (A4), and fix 9 ^ 9* such that p x (Yq 6 ;6) > fg*-a.s. 
For any B G y®( r o+ 1 ) suc h that F e ±(YQ 6 G B) > 0, define the probability mea- 
sure 

y(d Xre+1 ,Y Q re ^) 



\ B ,e(A)=Ro*[ / t A (x rg +i) 



¥'<> G B 



P X (Y o r °;0) 
on (X, X). Then we have 

F x (Y r ° G B,Y r n g+ i e A) = f e ,(Y^ G B)¥ X B - e (Y^' 1 G A) 
for any set A G y®( n - r e) . 

Proof. Note that by assumption (A4), P A is well defined (as shown 
in Lemma 11) and < p x {Yq b ;9) < oo Pg*-a.s. Moreover, as p x (yQ ;9) = 
J p x (dx rg+ \, y r e ;9), we find that Xb e is indeed a probability measure on 
(X,*). 

Let B G y®^ r e+ 1 ) be such that P e *(Y rfl G B) > 0. Then for any n > r e 
F x (Y r °eB,Y? g+1 eA) 



t A {yl +l )t B (y^)p\y n Q] 9) 



„X,„.n. Q My r e '' e *) i..n 



ia« +1 K v « +1 « + i;^)^ + i 

= G B)F XB '°(Y^- re - 1 G A), 

where we used p x (yQ]9) = J p x (dx m+ i,y™;9)p Xm+1 (y'! l ! n+1 ;9) for n>m. □ 

We can now complete the proof of Theorem 2. 

Proof of Theorem 2. Fix 8^9* such that p x {Y^ e ;9) > P e *-a.s., 
and define 

\r (rr , p A (rfs r9 +l,ffi;fl) 



B 



P x (y r o B ;0) 



< K 
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By (A6'), we can choose K sufficiently large so that ¥g*(Y^ e G B) > 0. Con- 
sequently \B,o(Vg) < K < oo by construction. As in the proof of Proposition 

15, it follows that there exists a sequence of sets A n G y^( n ~ r e) such that 

lim ^(y™-^" 1 G A n ) = 1, limsupn" 1 logPj Bl9 (y^ r9-1 G A n ) < 0. 

n-K» rwoo 

Define the sets 

Using the stationarity of Pg* and Lemma 16, it follows that 
lim P e * (Y n G A n ) = F g * (Y^ £B)>0, lim sup n' 1 log F$(Y n G I n ) < 0. 

n->oo n->oo 

This establishes (A6). □ 

Finally, let us prove Proposition 3. 

Proof of Proposition 3. Fix 6^9* such that p x (Yq°; 6) > JV-a.s., 
and let B = Y r » +1 . By (A6"), there exists a sequence of sets A n G y®( n ~ r o) 
such that 

liminf P *(y o "~ r ^ 1 G A n ) > 0, lim supra -1 logP^'^y™-^" 1 G A n ) < 0. 

n— >oo n— >oo 

Assumption (A6) now follows easily from the stationarity of Pg* and Lemma 

16. □ 



5.2. An Azuma-Hoeffding inequality. This section is somewhat indepen- 
dent of the remainder of the paper. We will prove a general Azuma-Hoeffding 
type large deviations inequality for ^/-uniformly ergodic Markov chains, on 
which the proof of Theorem 14 will be based (see Section 5.3). The follow- 
ing result may be seen as an extension of the Azuma-Hoeffding inequality 
obtained in [16] for uniformly ergodic Markov chains, and the proof of our 
result is similar to the proof of the Bernstein-type inequality in [1], Theo- 
rem 6. 



Theorem 17. Let (Xk)k>o be a Markov chain in (X, X) with transition 
kernel Q and initial measure rj under the probability measure P^. Assume 
that the transition kernel Q is V -uniformly ergodic, and denote by it its 
unique invariant measure. Then there exists a constant K such that 



i=l 



>t\< Krj(V) exp 



1 



A 



t 



*WI2o 1/1 



for any probability measure rj, bounded function f : X 



I, and t>0. 
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Remark 13. The exponential bound of Theorem 17 has a Bernstein- 
type tail, unlike the usual Azuma-Hoeffding bound. However, unlike the 
Bernstein inequality, the tail behavior is determined only by |/|oo 5 and not 
by the variance of /. We therefore still refer to this inequality as an Azuma- 
Hoeffding bound. It is shown in [1] by means of a counterexample that 
^-uniformly ergodic Markov chains do not admit, in general, a Bernstein 
bound of the type available for independent random variables (the bound in 
[1] depends on the variance at the cost of an extra logarithmic factor, which 
precludes its use for our purposes). 

Throughout this section, we let (Xk)k>o be as in Theorem 17. For sim- 
plicity, we work with a generic constant K which may change from line to 
line. 

Before we turn to the proof of Theorem 17, let us recall some standard 
facts from the theory of ^-uniformly ergodic Markov chains. It is well known 
([27], Chapter 16), that ^-uniform ergodicity in the sense of Definition 1 
implies (and is essentially equivalent to) the following properties: 

Minorization condition. There exist a set C G X, an integer m, a proba- 
bility measure v on (X, X) and a constant e > such that 

(17) Q m (x,A) > ev{A) for all x G C and all A G X. 

Foster-Lyapunov drift condition. There exists a measurable function V : X i— > 
[1, oo), A G [0, 1), and b < oo, such that sup xeC V(x) < oo and 

(18) QV(x) < XV{x) + bl c (x) for all x G X. 

The set C in the minorization condition is referred to as a (i>,m) -small 
set (see [27] for extensive discussion). For future reference, let us note that 

1 < ?r(V) = (1 - X)~ 1 tt(QV - XV) < (1 - A)- 1 6vr(C) < oo, 

which shows that 7r(V) < oo and 7r(C) > 0. Moreover, 

ett{C)v(V) < ir(Q m V) = tt{V) < oo, 

so that necessarily v{V) < oo also. 

The proof of Theorem 17 is based on an embedding of the Markov chain 
into a wide sense regenerative process ([20], page 360), known as a splitting 
construction. Let us recall how this can be done. We will employ the canon- 
ical process X n = (X n ,d n ) on the enlarged measure space (Cl,^), where 
Cl = (X x {0, 1}) N and ^ is the corresponding Borel u-field. In words, X n 
takes values in (X,Af) and d n is a binary random variable. Define the fol- 
lowing stopping times: 

(Tq = inf{n > : X n G C}, Uj+i = inf{?i > <7j + m : X n G C}. 
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We now construct a probability measure W 1 on (£1,^) with the following 
properties (e.g., by means of the Ionescu-Tulcea theorem): 

(d n )n>o are i-i-d. with F v (d n = 1) = e, 

Xq is independent from (d n ) n >o and ¥ v (Xq £ •) = 77, 

P"(X„ + i £ -|X re , = Q(A > „, •) on {n < a } U Jte + m < 77 < 

i>0 



q^'^Xdx), if<=l, 
q^^(.)i?(X CTi ,dx), if^=0. 



Here we defined the transition kernel i?(x, .A) = (1 — e) 1 {Q m (x, A) — ev(A)} 
for x £ C, and (using that X is Polish to ensure existence) the regular con- 
ditional probability q Xa ' Xm (A) d = P^Xf £ A|X ,X m ). 

The process (X n ) n >o is not necessarily Markov. However, it is easily ver- 
ified that the law of the process (X n ) n >o under F v is the same as the law 
of (X n ) n >o under F v , so that our original Markov chain is indeed embed- 
ded in this construction. Moreover, at every time a n such that additionally 
d Un = 1, we have by construction that X an+m is drawn independently from 
the distribution v, that is, the process regenerates in m steps. Let us define 
the regeneration times as 

<7 = f inf{cjj + m : i > 0, d ai = 1}, <7 n+ i == inf{cJj + m : <T; > a n , d ai = 1}. 

The regenerations will allow us to split the path of the process into one- 
dependent blocks, to which we can apply classical large deviations bounds 
for independent random variables. We formalize this as the following lemma. 



Lemma 18. Define for i > the block sums 

Then (Ci)i>o are identically distributed, one- dependent, and K v (^q) = 0. 

Proof. First, we note that P^(xf; +1_1 £ -|X^" m ) = P^A^ " 1 £ •) for 
all i. It follows directly that (£i)i>o are identically distributed and one- 
dependent. Moreover, as <7j is o"{XQ l_m }-measurable, we find that the inter- 
regeneration times (<Jj-t_i — o"j)i>o are independent. Now note that, by the 



CONSISTENCY OF THE MLE IN HIDDEN MARKOV MODELS 



37 



law of large numbers, 

E"(£ )= lim ~J> = I™ " E " ^ 

n— >oo 77, ii — » re— >oo 71 ' 

But limn-j.oo i X^i^i^+i ~~ ^»} = ^(^l ~~ <?o) < oo by the law of large num- 
bers and (19) below, while lirn^ E^{/(*fc) " = by the 

ergodic theorem for Markov chains. This completes the proof. □ 

In the proof of Theorem 17, we will need that fact that the inter-regeneration 
times d"o and <7j+i — Ui possess exponential moments. We presently establish 
that this is necessarily the case, adapting the proof of [29], Theorem 2.1. 

Proposition 19. There exists a constant K such that 
(19) t v [e^p{a /K)]<Krj(V) and E^exp^di - a }/K)] < K 
for every probability measure n. 

Proof. We begin by writing 

{d-Q - 777, = 77,} = [J{d aQ ,.. .,d (Tj _ 1 = 0,d aj = l,(7j = n}. 

i>o 

Using the independence of d CTj from do,..., d ajl , Oj, we have 

CO 

P"(er - ro = n) = £e(l - eyf v {o-j = n\d ao , . . . , d aj _, = 0). 

3=0 

In particular, we can write 

oo 

EV o/ *) = e m/K ^2e(l - eyt\e°>l K \d U0 ,. . . ,d aj _, = 0). 

3=0 

Now note that by construction, we have 

Define G(K) A = sup x£C 'E R ( x, '\e' 70 / K ). It is now easily established that 
t ri {e^ /K \d ao ,...,d (7j _ 1 =0)<e jm/K G(KyW(e CTo/K ). 



38 



DOUC, MOULINES, OLSSON AND VAN HANDEL 



We can therefore estimate 

- Iv ee m / K W 1 (e CTo/K ) 
ET e a ° ) < — — - 

provided that (1 - e)e m ' K G{K) < 1. 

Now note that it follows from [27], Theorem 15.2.5, that 

(20) t x (e ao/K ) < K{XV(x) + bl c (x)} for all x£X, 



provided K is chosen sufficiently large. Therefore, it is easily established 
that E'' ? (e f70 / A: ) < Krj(V) for K sufficiently large. On the other hand, by 
Jensen's inequality, G(K) < G{pf/ K for /3 < K. As G{fi) < oo for some 
by (20), we have G(K) -»• 1 as K -»• oo. Thus, (1 - e)e m / K G{K) < 1 for K 
sufficiently large, and we have proved E* ? [exp(o"o//^)] < Krj(V). To complete 
the proof, is suffices to note that E^expddi — <jq}/K)] = W[exp(ao/ K)] and 
u(V)<oo. □ 



With these preliminaries out of the way, we now prove Theorem 17. 



Proof Theorem 17. Define the sequence (&)^>o as in Lemma 18. We 

begin by splitting the sum S n = f 2^iLi{/PQ) — ^(Z)} m t° three different 
terms: 



(21) 



<ToAn-l i(n)— 1 

S n = £ {/(*.) _*(/)} + 6 

j=l fc=0 
n 

+ ^ {/(^)-7T(/)}, 
j=Z(n)An 



where i(n) = f YlkLi ^{o-jXn} an d l{n) = f <7j( n ). Using (19), we have for t > 



P'' 



(22) 



(TQ Aft — 1 

]T {/(*»-*(/)} 

i=i 

<P>0> W|oo] 

<E"[exp(CT /^)]exp 
< Kr)(V) exp 



> t 



2K\f\ c 



2K\f\ c 
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This bounds the first term of (21). To bound the last term of (21), we proceed 
as in the proof of [1], Lemma 3. First note that, for any t > 1, 

t v [n - l(n) A n + 1 > t] = t v [l(n) < n + l-t] 

n 

= ^2P n [a t <n + l-t, i{n) = £} 

1=0 
n 

= £F?[e^ < n + 1 - t,a e+1 > n]. 
e=o 

Recall that the inter-regeneration time &e+i — &e is independent from ao,. . . ,&e, 
and (ai + i — &i)i>o are identically distributed (see the proof of Lemma 18). 
Thus, 

[n+l-t\ 

F r >[a e <n+l-t,a e+1 >n}= £ F^ = k, a e+1 - a e > n - k] 

k=0 

Ln+l-tJ 

= £ H*r- 

k=0 



v [oi - (T > n - k]. 



But as a i < crg + i for all I > 0, we have Yl^o^i&e = k]<l for all k, so that 



n 



[n+l-tj 

((n)An + l>t]< ^[ax- a >n-k] 

k=0 

oo 

< P>l-<70>fc] 
k=\t-l\ 

oo 
k=\t-l\ 

Ke l ' K 



< 



-t/K 



l- e -i/K 

where we have used (19). We therefore find that for t > 2|/| c 



(23) 



£ {/(*;) -tt(/)} 

j=l(n)An 



>t 



<P ?/ [n-/(n)A?i + l>i/2|/|oo] 
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(recall that the constant K changes from line to line). But we may clearly 
choose K sufficiently large that Ke~ x l K > 1, so that (23) holds for any t > 0. 
It remains to bound the middle term in (21). As i{n) < n, we can estimate 



i{n)-l 
k=0 



< max 

0<j<|n/2j 



k=0 



+ max 

0<j<\n/2\ 



2j6fc+i 



k=0 



Both terms on the right-hand side of this expression are identically dis- 
tributed. We can therefore estimate using Etemadi's inequality ([5], Theo- 
rem 22.5), 

j(n)-l 



f></ 



E & 

k=0 



> t 



< 8 max W 

0<j<[n/2\ 



E^ 



k=0 



> t/8 



Note that [£&[ < 2|/| 00 (<t/ c+1 — a^), so that using (19) 



(2K|/| 0O ) 2 EW e l^l/ 2 ^l/ 



1 



l&l 



mil 



<AK z \f\ 



2 

oo " 



Using Bernstein's inequality ([30], Lemma 2.2.11), we obtain 



k=0 



>ti 



< 2exp 



We can therefore estimate for t > 

t(n)-l 

E & 



(24) 



fc=0 



< K exp 



K(j + 1)1/1^ +t|/|c 



ifnl/l^ + tl/lc 



The proof is completed by combining (22), (23) and (24). □ 

5.3. Proof of Theorem 14- Assume without loss of generality that 
Eg[/(Yo )] = 0. To prove the result, it suffices to bound each term in the 
decomposition 



E/oy-*) E E-> ) +J2 E oim +s )\xt\Yt 1 ) 

j=0 \i=l 



1=1 



1=1 



where we have defined for any < j < s and i > 1 



^ d ^ f E^/(y/+ s )|x^,y ^'; 



E£(/(r/+*)|x^,r ' 



By construction, ((y)i<Kn are martingale increments for each j, and |£i,j|oo < 
2|/|oo- Therefore, by the Azuma-Hoeffding inequality ([32], page 237), we 
have 

E&J ] <2exp 



i=l 



8n|/| 
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for each < j < s. On the other hand, note that E^(/(^ +s )|X^ _1 ,l^ _1 ) = 
for all i, where F satisfies ne(F) = (as we assumed Ee[/(y o s )] = 0) 
and l-Floo < |/|oo- The result therefore follows by applying Theorem 17. 
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